# Queries on data from jupyter reproducibility study

joined-data.feather created by db-to-df.ipynb

We produce various queries to sample subsets of repos to compare execution results with repo2docker to results in [this prior study](https://zenodo.org/record/2592524).

In [1]:
import sys
import pandas as pd

Load the stored data, cached from an export of the above dataset.

In [2]:
%%time
df = pd.read_feather("joined-data.feather")
df

CPU times: user 1.98 s, sys: 693 ms, total: 2.67 s
Wall time: 2.7 s


Unnamed: 0,repo,commit,notebook,reason,exmode,exskip,nbskip,language,language_version,processed_execution,processed_notebook,processed_repo,setups_count,setups,requirements_count,requirements,pipfiles_count,pipfiles,pipfile_locks_count,pipfile_locks
0,tambetm/pgexperiments,ad4dada7dfe4c5fb8323597f73129157ede4b5fd,MNIST_PG_running_mean.ipynb,ImportError,5.0,0.0,0,python,2.7.14,39.0,131104,8329,0,,0,,0,,0,
1,JonnyRed/IPython-Notebooks,6675f32167646efbda1df575686de9becf3e4f85,YoungAndFreedman13/Chapter04-Newtons-Laws-of-M...,<Install Dependency Error>,3.0,0.0,0,python,2.7.8,0.0,8224,8329,1,SciPy_SymPy/ipython_doctester/setup.py,1,Numerical-Python/requirements.txt,0,,0,
2,vshaumann/Test,bee2b00d04e0366046e6b25820b43e6ae0a78405,KS Divergence.ipynb,,5.0,0.0,0,python,3.6.0,35.0,131104,8329,0,,0,,0,,0,
3,Henrilin28/ADS_Final_Homeless,57378e27768bb4627883eef243d892901a174a11,Time series plots/SingleWomenTimesSeries.ipynb,,5.0,0.0,0,python,2.7.12,35.0,131104,8329,0,,0,,0,,0,
4,luoyuweidu/Script,f0f0bcdfeef66312f73a5c0c9dabeb5fe406a699,ASSO - getting receipts.ipynb,error,5.0,0.0,0,python,2.7.13,39.0,131104,8329,0,,0,,0,,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1450072,cjwinchester/2018-03-09-nicar-class,7d05fe55e5181aa2c44d8989d2851d22196633e9,completed/18. Setting up Python on your own co...,,,,416,python,3.6.4,,32,8329,0,,1,requirements.txt,0,,0,
1450073,tjh1997/pandas_exercises,3da2d91d8b4080520d0bccd053fe17bd23c702eb,09_Time_Series/Getting_Financial_Data/Exercise...,,,,416,python,2.7.11,,32,8329,0,,0,,0,,0,
1450074,pzz2011/demos.ml,3e3f622408859f1f2f791c642f066d00d0e41d4b,jupyter/notebooks/Spark/SparkR/ApacheArrow-Fea...,,,,128,R,3.3.0,,32,8329,0,,0,,0,,0,
1450075,cfchang/cs591,18d5c7f3db291f8814ec5a2b1f5e9b66528cb44a,Homework-4/4.Food-recipes.ipynb,,,,96,unknown,unknown,,32,8329,0,,0,,0,,0,


In [3]:
df["language"] = df["language"].str.lower()
df.groupby("language").notebook.count().sort_values(ascending=False).head(10)

language
python      1334002
unknown       78021
r             16860
julia         12028
scala          1518
bash           1008
lua             698
ruby            655
octave          625
scala211        593
Name: notebook, dtype: int64

We are going to create task lists of repositories to test,
based on certain queries. To do this, we have a `save_grouped` function
that dumps a list of repos and commits, with info about the number of notebooks,
their language(s), and language versions.

So far, we have the following interesting queries:

1. Julia notebooks
2. R notebooks
3. repos where installation failed
4. repos where execution succeeded after dependency installation (Python-only because the prior study only executed Python)
5. repos where execution succeeded with anaconda, without dependencies specified (Python-only)
6. repos where execution succeeded *and* environment is specified with top-level `requirements.txt` (subset of 4 where we know repo2docker will find the requirements.txt)
7. repos where failure is due to import error or ModuleNotFound, indicating a missing dependency

The database stored these status labels in a bitmask `executions.processed` column.

In [4]:
status_labels = {
    0b000000: "failed to install",
    0b000001: "installed",
    0b000010: "loaded",
    0b000100: "exception",
    0b001000: "timeout",
    0b010000: "same results",
    0b100000: "ok",
}
status = {label:mask for mask, label in status_labels.items()}

In [5]:
random_seed = 1 # set random seed for reproducible output


def agg_unique(x):
    """Aggregator to return comma-separated unique values"""
    return ";".join(sorted(x.unique()))


def save_grouped(df, dest, limit=None, sort=True):
    """Save a subset of data to a .txt file,
    
    grouped by repo, showing set of languages and number of notebooks
    """
    grouped = df.groupby(["repo", "commit"]).agg(
        {
        "notebook": "count",
        "language": agg_unique,
        "language_version": agg_unique,
        }
    ).reset_index()
    if sort:
        grouped = grouped.sort_values("notebook")
    else:
        # random order
        grouped = grouped.sample(frac=1, random_state=random_seed)
    with open(dest, "w") as f:
        f.write(f"{'repo'.rjust(40)}, {'commit'.rjust(40)}, notebooks, language, version\n")
        limited = grouped[:limit]
        for idx, row in limited.iterrows():
            f.write(f"{row.repo.rjust(40)}, {row.commit}, {row.notebook:9}, {row.language.rjust(9)}, {row.language_version.rjust(7)}\n")
    print(f"Wrote {len(limited)}/{len(grouped)} records to {dest}")
            

def matching_repos(mask, mode='any'):
    """Expand a subset applied at the notebook or execution level to all records
    for matching repos.
    
    Match mode can be 'any' (any records for a given repo match)
    or 'all' (all records for a given repo match)
    
    Because groupby.filter(mask.any()) takes 100 times as long
    """
    if mode == 'all':
        # De Morgan: all = not (not any)
        not_any = df[~mask]
        unmatching_repos = not_any.repo.unique()
        return df[~df.repo.isin(unmatching_repos)]
    elif mode == 'any':
        subset = df[mask]
        matching_repos = subset.repo.unique()
        return df[df.repo.isin(matching_repos)]
    else:
        raise ValueError(f"mode must be 'any' or 'all', not '{mode}''")

## Julia notebooks

Collect all of the repositories with at least one julia notebook.

In [6]:
julia = matching_repos(df["language"] == "julia")
print(f"{len(julia.repo.unique())} repos containing some Julia notebooks")

2328 repos containing some Julia notebooks


In [7]:
all_julia = matching_repos(df["language"] == "julia", mode='all')
print(f"{len(all_julia.repo.unique())} repos containing only Julia notebooks")

1630 repos containing only Julia notebooks


In [8]:
save_grouped(all_julia, "julia.txt", limit=1000)

Wrote 1000/1630 records to julia.txt


# R notebooks

same as Julia, but for R!

In [9]:
r = julia = matching_repos(df["language"] == "r", "all")
print(f"{len(r.repo.unique())} repos containing only R notebooks")
save_grouped(r, "r.txt", limit=1000, sort=True)

2501 repos containing only R notebooks
Wrote 1000/2501 records to r.txt


#  failed installation

A failed installation occurs when the "execution mode" == 3 (install with dependencies) and "processed_execution" is 0

In [10]:
install_failed_mask = (df["processed_execution"] == status["failed to install"]) & (df["exmode"] == 3)
install_failed = matching_repos(install_failed_mask)
save_grouped(install_failed, "install-failed.txt", sort=False, limit=1000)

Wrote 1000/11877 records to install-failed.txt


# success

We explore various versions of success

- exmode == 3 (found and installed dependencies)
- exmode == 5 (found no dependencies, ran with anaconda)

In [11]:
df["exmode"].unique()

array([ 5.,  3., nan,  4.])

Extract successful runs from the `processed_execution` bitmask:

- OK, AND NOT
- Exception, AND NOT
- Timeout

In [12]:
processed_mask = ~df["processed_execution"].isna()
without_unprocessed = df.loc[processed_mask]
# success mask: ok AND NOT exception or timeout
success_mask = (
    without_unprocessed["processed_execution"].astype(int) & (
        status["ok"] | status["exception"] | status["timeout"]
    ) == status["ok"]
)
# define new boolean success column with the result
df.loc[processed_mask, "success"] = success_mask
# recreate the view with new column
without_unprocessed = df.loc[processed_mask]

Execution success rate with installed dependencies:

In [13]:
rate = df[df.exmode == 3].success.mean()
print(f"success rate after successful installation of dependencies: {round(rate * 100, 1)}%")

success rate after successful installation of dependencies: 7.6%


And the success rate for notebooks without specified dependencies, but run in Anaconda

In [14]:
rate = df[df.exmode == 5].success.mean()
print(f"success rate without specified dependencies, run in Anaconda: {round(rate * 100, 1)}%")

success rate without specified dependencies, run in Anaconda: 24.9%


Count the fraction of runs for "success with dependencies" where
a top-level `requirements.txt` is used.
These are repos we should expect repo2docker to understand.

In [15]:
req_success = df[
    (df.exmode == 3)
    & df.success
    & df.requirements_count
]
print(f"{len(req_success)} notebooks successfully executed with some requirements.txt specified")

9623 notebooks successfully executed with some requirements.txt specified


Count the fraction of these where a top-level requirements.txt is used:

In [16]:
rate = (req_success["requirements"] == "requirements.txt").mean()
print(f"Fraction of successful executions with only standard top-level requirements.txt: {rate * 100:.0f}%")

Fraction of successful executions with only standard top-level requirements.txt: 54%


The distribution of the number of requirements.txt files for successful runs

In [17]:
req_success.groupby("requirements_count").repo.agg(lambda x: len(x.unique())).sort_index()

requirements_count
1     2649
3       58
5       26
7       12
9       16
11     265
13       9
15       1
17       1
33       1
Name: repo, dtype: int64

The same data as above, showing groupings of layouts,
indicating that many of the multi-requirements layouts are for repos with identical layout,
such as those created for online courses.

In [18]:
req_success.groupby("requirements").repo.agg(lambda x: len(x.unique())).sort_values().tail()

requirements
sentiment-network/requirements.txt;weight-initialization/requirements.txt;intro-to-tflearn/requirements.txt;tensorboard/requirements.txt;intro-to-tensorflow/requirements.txt;first-neural-network/requirements.txt;transfer-learning/requirements.txt;sentiment-rnn/requirements.txt;embeddings/requirements.txt;tv-script-generation/requirements.txt;intro-to-rnns/requirements.txt      52
requirements/requirements.txt                                                                                                                                                                                                                                                                                                                                                               52
aimacode/requirements.txt                                                                                                                                                                                                    

Now create lists of repos with complete success:

1. success with dependencies
2. success with dependencies where there is exactly one top-level requirements.txt
3. success without dependencies (with anaconda) — we don't expect much success with repo2docker here

In [19]:
%%time
successful_repos = matching_repos(df["success"] & (df["exmode"] == 3), mode="all")
save_grouped(successful_repos, "success-with-dependencies.txt", sort=True)

Wrote 971/971 records to success-with-dependencies.txt
CPU times: user 1.45 s, sys: 159 ms, total: 1.61 s
Wall time: 1.72 s


In [20]:
success_default_requirements = successful_repos[successful_repos["requirements"] == "requirements.txt"]
save_grouped(success_default_requirements, "success-default-requirements.txt", sort=True)

Wrote 798/798 records to success-default-requirements.txt


In [21]:
success_without_dependencies = matching_repos(
    df["success"] & (df["exmode"] == 5),
    mode="all",
)
save_grouped(success_without_dependencies, "success-without-dependencies.txt", sort=False, limit=1000)

Wrote 1000/25379 records to success-without-dependencies.txt


In [22]:
import_error_mask = df["reason"].str.contains("ImportError") | df["reason"].str.contains("ModuleNotFound")
import_errors = matching_repos(import_error_mask)
save_grouped(import_errors, "import-error.txt", limit=1000, sort=False)

Wrote 1000/95616 records to import-error.txt
