# Analyse Imports

This notebook:
- Analyses the most popular imports.
- Constructs a graph to determine the number of import 'hops' from a module to an ML-related import. 
- Categorises paths in each repo as ML0 (direct user of ML), ..., ML4 (indirect user of ML), unittests, or other

Outputs:
- `proj_labels.csv`

In [1]:
import pandas as pd
import numpy as np
from os.path import join
import pathlib
import altair as alt # Python wrapper for Vega-Lite visualisation grammar

In [2]:
DATA_DIR = "../output/"
NB_OUT = join(DATA_DIR, "notebooks_out")
MERGED_DIR = join(DATA_DIR, "merged")
pathlib.Path(NB_OUT).mkdir(parents=True, exist_ok=True)

In [3]:
proj_imports3 = pd.read_csv(join(MERGED_DIR, "results_imports_python3.csv"))
proj_imports3

Unnamed: 0.1,Unnamed: 0,repo,path,module_name,import_name,parse_error
0,0,190000321,190000321/setup.py,setup,codecs,False
1,1,190000321,190000321/setup.py,setup,os.path,False
2,2,190000321,190000321/setup.py,setup,setuptools,False
3,3,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.iterators.get_balanced_ba...,False
4,4,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.signal_target.SignalAndTa...,False
...,...,...,...,...,...,...
1329154,801996,38355280,38355280/lib/core/core.py,,,True
1329155,801997,38347270,38347270/main.py,,,True
1329156,801998,38347270,38347270/.buildozer/android/app/main.py,,,True
1329157,801999,38347270,38347270/.buildozer/android/app/sitecustomize.py,sitecustomize,os.path,False


In [4]:
proj_imports3.nunique()[["repo", "path", "module_name", "import_name"]]

repo             3932
path           236262
module_name    181758
import_name    210302
dtype: int64

In [5]:
len(proj_imports3[proj_imports3["parse_error"]])

17656

In [6]:
len(proj_imports3[pd.isnull(proj_imports3["module_name"])])

17656

## Identify Popular Libraries/Frameworks

To make it easier to identify popular imports (libraries/frameworks), we define the "import_short_name" to be just the top level of an import e.g. "sklearn.model_selection.train_test_split" => "sklearn".

In [7]:
proj_imports_all = proj_imports3

In [8]:
proj_imports_all["import_short_name"] = proj_imports_all["import_name"].apply(lambda x: x.split(".")[0] if not pd.isnull(x) else "")

In [9]:
proj_imports_all

Unnamed: 0.1,Unnamed: 0,repo,path,module_name,import_name,parse_error,import_short_name
0,0,190000321,190000321/setup.py,setup,codecs,False,codecs
1,1,190000321,190000321/setup.py,setup,os.path,False,os
2,2,190000321,190000321/setup.py,setup,setuptools,False,setuptools
3,3,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.iterators.get_balanced_ba...,False,braindecode
4,4,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.signal_target.SignalAndTa...,False,braindecode
...,...,...,...,...,...,...,...
1329154,801996,38355280,38355280/lib/core/core.py,,,True,
1329155,801997,38347270,38347270/main.py,,,True,
1329156,801998,38347270,38347270/.buildozer/android/app/main.py,,,True,
1329157,801999,38347270,38347270/.buildozer/android/app/sitecustomize.py,sitecustomize,os.path,False,os


In [10]:
# popularity of imports (based on number of *repos* that use them)
popular_imports = proj_imports_all.groupby(["import_short_name"])["repo"].nunique().sort_values(ascending=False)

In [11]:
popular_imports

import_short_name
os                    3097
                      2964
sys                   2689
numpy                 2029
time                  1981
                      ... 
multi_armed_bandit       1
multi                    1
mult_time                1
mulpyplexer              1
zzz                      1
Name: repo, Length: 16138, dtype: int64

In [12]:
popular_imports.to_csv(join(NB_OUT, "popular.csv"), header=True)

Note: The top import for '' (empty string) represents no import (e.g. due to a file with no imports, or that could not be parsed). It can be ignored.

## Annotated Libs

In [13]:
# https://github.com/boalang/MSR19-DataShowcase/blob/master/info.txt

ml_libs = [
    "theano",
    #"pytroch", # boa contains a typo, we manually correct this
    "pytorch",
    "caffe", "keras", "tensorflow", "sklearn",
    "numpy", "scipy", "pandas", "statsmodels",
    "matplotlib", "seaborn", "plotly", "bokeh", "pydot",
    "xgboost", "catboost", "lightgbm", "eli5",
    "elephas", "spark", "nltk", "cntk", "scrapy", "gensim",
    "pybrain", "lightning", "spacy", "pylearn2",
    "nupic", "pattern", "imblearn", "pyenv"
]

In [14]:
popular_imports_df = pd.DataFrame({"lib": popular_imports.index, "cnt": popular_imports.values})
popular_imports_df["lib"] = popular_imports_df["lib"].replace('', 'None')
popular_imports_df

Unnamed: 0,lib,cnt
0,os,3097
1,,2964
2,sys,2689
3,numpy,2029
4,time,1981
...,...,...
16133,multi_armed_bandit,1
16134,multi,1
16135,mult_time,1
16136,mulpyplexer,1


In [15]:
popular_imports_df_ml = popular_imports_df[popular_imports_df["lib"].isin(ml_libs)]
popular_imports_df_ml

Unnamed: 0,lib,cnt
3,numpy,2029
16,tensorflow,941
20,matplotlib,871
22,scipy,813
33,sklearn,602
39,pandas,531
53,keras,427
97,theano,174
98,nltk,173
160,caffe,80


In [16]:
num_repos = proj_imports_all.nunique()["repo"]
num_repos 

3932

Note that in order to keep the figure size reasonable, we only show ML libs, rather than all imports:

In [17]:
chart = alt.Chart(popular_imports_df_ml).mark_bar().encode(
    x = alt.X('cnt', type='quantitative', title="Number of Python repos (out of %s) that import Lib" % num_repos, scale=alt.Scale(domain=(0,int(num_repos)))),
    y = alt.Y('lib', type='nominal', title="Lib (only ML libs shown)", sort=alt.EncodingSortField(order="descending")),
)
chart

In [18]:
#chart.save("imports.svg")

## Which ML libraries are used together in projects?

Some projects will use a combination of libraries. Inspired by https://blog.bitergia.com/2018/04/02/a-preliminary-analysis-on-the-use-of-python-notebooks/, an attempt was made to examine which libraries were used together. (Note that for now we just analyse combinations of ML libraries, else there are too many combinations to make sense of)

In [19]:
import_sets = []
proj_imports_grouped = proj_imports_all.groupby("repo")
for repo,repo_df in proj_imports_grouped:
    libs_set = tuple(sorted(set(repo_df["import_short_name"]) & set(ml_libs))) # empty tuple if none
    import_sets.append([repo, libs_set])

import_sets_df = pd.DataFrame.from_records(import_sets, columns=["repo", "importset"])

For each repo, which ML related libraries it uses:

In [20]:
import_sets_df

Unnamed: 0,repo,importset
0,118130,"(matplotlib, nltk, numpy, scipy, sklearn, tens..."
1,192904,"(matplotlib, numpy, pandas, scipy, theano)"
2,329033,()
3,379988,"(numpy, scipy)"
4,462713,"(numpy, scipy, theano)"
...,...,...
3927,220350524,()
3928,222271895,()
3929,234515221,"(numpy,)"
3930,236706700,"(numpy,)"


Which combinations are most popular (note that a single import anywhere in the repo will count as usage):

In [21]:
import_sets_df.groupby("importset").count().rename(columns={"repo": "num_repos"}).sort_values("num_repos", ascending=False)

Unnamed: 0_level_0,num_repos
importset,Unnamed: 1_level_1
(),1722
"(numpy, tensorflow)",194
"(numpy,)",169
"(numpy, scipy, tensorflow)",84
"(matplotlib, numpy)",77
...,...
"(lightning, matplotlib, numpy, pandas, scipy, seaborn, sklearn)",1
"(lightning, matplotlib, numpy, scipy, sklearn)",1
"(cntk, numpy, pandas, tensorflow)",1
"(matplotlib, nltk)",1


The above shows that many repos mix ML libraries/frameworks.

A limitations of the analysis is that it doesn't distinguish between using a library in example code rather than in a core part of the the project. e.g. SpaCy contains example code in `21467110/examples/` that imports keras and tensorflow, but these are not really part of SpaCy.

## Identification of test code

One way to detect test code is to search the path/module name for the word "test"

In [22]:
proj_imports_test = proj_imports_all[proj_imports_all["path"].str.contains("test") | proj_imports_all["import_name"].str.contains("test")]

In [23]:
proj_imports_test

Unnamed: 0.1,Unnamed: 0,repo,path,module_name,import_name,parse_error,import_short_name
3,3,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.iterators.get_balanced_ba...,False,braindecode
4,4,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.signal_target.SignalAndTa...,False,braindecode
5,5,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.models.shallow_fbcsp.ShallowFBCSPNet,False,braindecode
6,6,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.torch_ext.util.np_to_var,False,braindecode
7,7,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.torch_ext.util.set_random_seeds,False,braindecode
...,...,...,...,...,...,...,...
1328948,801790,38386331,38386331/testing/test_version.py,test_version,pytest,False,pytest
1328949,801791,38386331,38386331/testing/test_version.py,test_version,setuptools_scm.config.Configuration,False,setuptools_scm
1328950,801792,38386331,38386331/testing/test_version.py,test_version,setuptools_scm.version.meta,False,setuptools_scm
1328951,801793,38386331,38386331/testing/test_version.py,test_version,setuptools_scm.version.simplified_semver_version,False,setuptools_scm


In [24]:
pd.Series(proj_imports_test["path"].unique())

0        190000321/test/acceptance_tests/from_notebooks...
1        190000321/test/acceptance_tests/from_notebooks...
2        190000321/test/acceptance_tests/from_notebooks...
3        190000321/test/unit_tests/datautil/test_trial_...
4         160251929/tensorflow_ranking/python/data_test.py
                               ...                        
61229                 38386331/testing/test_file_finder.py
61230                   38386331/testing/test_functions.py
61231                         38386331/testing/conftest.py
61232                   38386331/testing/test_basic_api.py
61233                     38386331/testing/test_version.py
Length: 61234, dtype: object

However, the problem with this approach is that we may accidentally flag model 'test' related code (in ML sense of train/test) in addition to unittests (in the software engineering sense).

A more conservative approach to ensure we detect just Software tests is to flag code that imports a unittesting framework. Initially, we just searched for code that imported the Python 'unittest' module, however, this failed to detect the testing code in SpaCy, which uses `pytest` instead.

A list of common Python unittesting frameworks was taken from:
https://docs.python-guide.org/writing/tests/

In [25]:
test_frameworks = ["unittest", "pytest", "unittest2", "mock"]

In [26]:
unittests = proj_imports_all[proj_imports_all["import_short_name"].isin(test_frameworks)]
unittests

Unnamed: 0.1,Unnamed: 0,repo,path,module_name,import_name,parse_error,import_short_name
63,63,190000321,190000321/test/unit_tests/datautil/test_trial_...,test_trial_segment,pytest,False,pytest
436,436,159175746,159175746/tests/utils/test_metrics.py,test_metrics,unittest,False,unittest
440,440,159175746,159175746/tests/utils/test_utils.py,test_utils,unittest,False,unittest
441,441,159175746,159175746/tests/utils/test_utils.py,test_utils,unittest.mock,False,unittest
445,445,159175746,159175746/tests/utils/test_datahandler.py,test_datahandler,unittest,False,unittest
...,...,...,...,...,...,...,...
1328917,801759,38386331,38386331/testing/test_file_finder.py,test_file_finder,pytest,False,pytest
1328921,801763,38386331,38386331/testing/test_functions.py,test_functions,pytest,False,pytest
1328935,801777,38386331,38386331/testing/conftest.py,conftest,pytest,False,pytest
1328942,801784,38386331,38386331/testing/test_basic_api.py,test_basic_api,pytest,False,pytest


In [27]:
paths_unittests = unittests.path.unique()
pd.Series(paths_unittests)

0        190000321/test/unit_tests/datautil/test_trial_...
1                    159175746/tests/utils/test_metrics.py
2                      159175746/tests/utils/test_utils.py
3                159175746/tests/utils/test_datahandler.py
4             159175746/tests/utils/test_trainer_helper.py
                               ...                        
26505                 38386331/testing/test_file_finder.py
26506                   38386331/testing/test_functions.py
26507                         38386331/testing/conftest.py
26508                   38386331/testing/test_basic_api.py
26509                     38386331/testing/test_version.py
Length: 26510, dtype: object

Importing a unittesting framework is a strong indication that the code is a software test. However, it may not be able to identify all tests.

In particular, the `pytest` framework only requires test code to follow naming conventions and use the `assert` statement. This means some unittests will not import `pytest` (or any other unittesting framework) at all, causing them to go undetected (e.g. `21467110/spacy/tests/regression/test_issue3531.py`)

## Identification of code that imports ML (ML₀), code that imports code that imports ML (ML₁), etc.

A Python module was written to provide the utilities needed for this part of the analysis (source code + unittests in this dir)

In [28]:
import import_graph

Before proceeding any further, we remove unitteset code. This ensures that code flagged as importing code that imports ML is actually part of the data processing pipeline rather than just unittests. This also helps to avoid the risk of multiple paths with the module name (which may cause problems when building the graph) as a result of unittests with the same name as the code that they test.

In [29]:
proj_imports_filt = proj_imports_all[~proj_imports_all["path"].isin(paths_unittests)]

In [30]:
proj_imports_filt

Unnamed: 0.1,Unnamed: 0,repo,path,module_name,import_name,parse_error,import_short_name
0,0,190000321,190000321/setup.py,setup,codecs,False,codecs
1,1,190000321,190000321/setup.py,setup,os.path,False,os
2,2,190000321,190000321/setup.py,setup,setuptools,False,setuptools
3,3,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.iterators.get_balanced_ba...,False,braindecode
4,4,190000321,190000321/test/acceptance_tests/from_notebooks...,test_trialwise_decoding,braindecode.datautil.signal_target.SignalAndTa...,False,braindecode
...,...,...,...,...,...,...,...
1329154,801996,38355280,38355280/lib/core/core.py,,,True,
1329155,801997,38347270,38347270/main.py,,,True,
1329156,801998,38347270,38347270/.buildozer/android/app/main.py,,,True,
1329157,801999,38347270,38347270/.buildozer/android/app/sitecustomize.py,sitecustomize,os.path,False,os


In [31]:
proj_imports_all_sanitized = import_graph.sanitize_modules(proj_imports_filt)

In [32]:
processed_df = import_graph.process_repos(proj_imports_all_sanitized, ml_libs)

In [33]:
processed_df[processed_df["hops"] == 0]

Unnamed: 0,path,module_name,hops,repo
0,118130/clustering/dpm.py,dpm,0.0,118130
1,118130/clustering/irm.py,irm,0.0,118130
2,118130/dnn/cdcgan-svhn.py,cdcgan-svhn,0.0,118130
3,118130/dnn/cgan-mnist.py,cgan-mnist,0.0,118130
4,118130/dnn/dcgan-svhn.py,dcgan-svhn,0.0,118130
...,...,...,...,...
515,209411984/tests/python/bl_bundled_modules.py,bl_bundled_modules,0.0,209411984
7,236706700/pysynth_b.py,pysynth_b,0.0,236706700
11,236706700/pysynth_e.py,pysynth_e,0.0,236706700
13,236706700/pysynth_s.py,pysynth_s,0.0,236706700


In [34]:
processed_df[processed_df["hops"] == 0].repo.unique().shape

(2039,)

In [35]:
processed_df[processed_df["hops"] == 1]

Unnamed: 0,path,module_name,hops,repo
3,192904/pymc3/backends/__init__.py,pymc3.backends.__init__,1.0,192904
5,192904/pymc3/backends/hdf5.py,pymc3.backends.hdf5,1.0,192904
7,192904/pymc3/backends/report.py,pymc3.backends.report,1.0,192904
13,192904/pymc3/distributions/__init__.py,pymc3.distributions.__init__,1.0,192904
36,192904/pymc3/examples/factor_potential.py,pymc3.examples.factor_potential,1.0,192904
...,...,...,...,...
19,205326746/tools/utils/__init__.py,tools.utils.__init__,1.0,205326746
86,207705426/data_capture/analysis/core.py,data_capture.analysis.core,1.0,207705426
1,236706700/menv.py,menv,1.0,236706700
15,236706700/read_abc.py,read_abc,1.0,236706700


In [36]:
processed_df[processed_df["hops"] == 1].repo.unique().shape

(744,)

In [37]:
processed_df[processed_df["hops"] == 2]

Unnamed: 0,path,module_name,hops,repo
66,192904/pymc3/smc/__init__.py,pymc3.smc.__init__,2.0,192904
49,379988/milk/tests/test_basic.py,milk.tests.test_basic,2.0,379988
35,1235740/examples/03_connectivity/plot_compare_...,plot_compare_decomposition,2.0,1235740
57,2233998/hyperspy/api_nogui.py,hyperspy.api_nogui,2.0,2233998
269,2233998/hyperspy/utils/__init__.py,hyperspy.utils.__init__,2.0,2233998
...,...,...,...,...
25,174000033/pyvisa/resources/usb.py,pyvisa.resources.usb,2.0,174000033
27,174000033/pyvisa/rname.py,pyvisa.rname,2.0,174000033
29,174000033/pyvisa/testsuite/test_errors.py,pyvisa.testsuite.test_errors,2.0,174000033
180,180872240/projects/imagenet/experiments/mixed_...,experiments.mixed_precision,2.0,180872240


In [38]:
processed_df[processed_df["hops"] == 2].repo.unique().shape

(263,)

In [39]:
processed_df[processed_df["hops"] == 3]

Unnamed: 0,path,module_name,hops,repo
56,2233998/hyperspy/api.py,hyperspy.api,3.0,2233998
8,6404963/blaze/compute/dask.py,blaze.compute.dask,3.0,6404963
10,6404963/blaze/compute/hdfstore.py,blaze.compute.hdfstore,3.0,6404963
11,6404963/blaze/compute/json.py,blaze.compute.json,3.0,6404963
26,6404963/blaze/compute/tests/test_chunks.py,test_chunks,3.0,6404963
...,...,...,...,...
517,159149096/pyscf/gto/cmd_args.py,pyscf.gto.cmd_args,3.0,159149096
20,174000033/pyvisa/resources/pxi.py,pyvisa.resources.pxi,3.0,174000033
21,174000033/pyvisa/resources/registerbased.py,pyvisa.resources.registerbased,3.0,174000033
26,174000033/pyvisa/resources/vxi.py,pyvisa.resources.vxi,3.0,174000033


In [40]:
processed_df[processed_df["hops"] == 3].repo.unique().shape

(95,)

In [41]:
processed_df[processed_df["hops"] == 4]

Unnamed: 0,path,module_name,hops,repo
142,2233998/hyperspy/logger.py,hyperspy.logger,4.0,2233998
43,20186184/databench_py/__init__.py,databench_py.__init__,4.0,20186184
238,40187375/multiqc/utils/config.py,multiqc.utils.config,4.0,40187375
1,40328394/angr/analyses/__init__.py,angr.analyses.__init__,4.0,40328394
12,40328394/angr/analyses/cfg/cfg.py,angr.analyses.cfg.cfg,4.0,40328394
...,...,...,...,...
281,130375797/python/dgl/model_zoo/chem/acnn.py,dgl.model_zoo.chem.acnn,4.0,130375797
285,130375797/python/dgl/model_zoo/chem/gnn.py,dgl.model_zoo.chem.gnn,4.0,130375797
298,130375797/python/dgl/model_zoo/chem/mpnn.py,dgl.model_zoo.chem.mpnn,4.0,130375797
123,131881622/PyTorch/Segmentation/MaskRCNN/pytorc...,maskrcnn_benchmark.modeling.detector.__init__,4.0,131881622


In [42]:
processed_df[processed_df["hops"] == 4].repo.unique().shape

(32,)

In [43]:
processed_df[processed_df["hops"] == 5]

Unnamed: 0,path,module_name,hops,repo
225,40187375/multiqc/plots/beeswarm.py,multiqc.plots.beeswarm,5.0,40187375
226,40187375/multiqc/plots/heatmap.py,multiqc.plots.heatmap,5.0,40187375
228,40187375/multiqc/plots/scatter.py,multiqc.plots.scatter,5.0,40187375
230,40187375/multiqc/plots/table_object.py,multiqc.plots.table_object,5.0,40187375
233,40187375/multiqc/templates/default_dev/__init_...,multiqc.templates.default_dev.__init__,5.0,40187375
...,...,...,...,...
41,104973687/myia/info.py,myia.info,5.0,104973687
47,104973687/myia/ir/metagraph.py,myia.ir.metagraph,5.0,104973687
194,104973687/myia/operations/utils.py,myia.operations.utils,5.0,104973687
196,104973687/myia/opt/cse.py,myia.opt.cse,5.0,104973687


In [44]:
processed_df[processed_df["hops"] == 5].repo.unique().shape

(12,)

In [45]:
processed_df[processed_df["hops"] == 6]

Unnamed: 0,path,module_name,hops,repo
9,40187375/multiqc/modules/bamtools/stats.py,multiqc.modules.bamtools.stats,6.0,40187375
63,40187375/multiqc/modules/deeptools/plotCorrela...,multiqc.modules.deeptools.plotCorrelation,6.0,40187375
67,40187375/multiqc/modules/deeptools/plotPCA.py,multiqc.modules.deeptools.plotPCA,6.0,40187375
133,40187375/multiqc/modules/peddy/peddy.py,multiqc.modules.peddy.peddy,6.0,40187375
168,40187375/multiqc/modules/rseqc/bam_stat.py,multiqc.modules.rseqc.bam_stat,6.0,40187375
183,40187375/multiqc/modules/samtools/flagstat.py,multiqc.modules.samtools.flagstat,6.0,40187375
215,40187375/multiqc/modules/vcftools/relatedness2.py,multiqc.modules.vcftools.relatedness2,6.0,40187375
18,40328394/angr/analyses/cfg/cfg_job_base.py,angr.analyses.cfg.cfg_job_base,6.0,40328394
23,40328394/angr/analyses/cfg/indirect_jump_resol...,angr.analyses.cfg.indirect_jump_resolvers.jump...,6.0,40328394
27,40328394/angr/analyses/cfg/indirect_jump_resol...,angr.analyses.cfg.indirect_jump_resolvers.x86_...,6.0,40328394


In [46]:
processed_df[processed_df["hops"] == 6].repo.unique().shape

(8,)

In [47]:
processed_df[processed_df["hops"] == 7]

Unnamed: 0,path,module_name,hops,repo
8,40187375/multiqc/modules/bamtools/bamtools.py,multiqc.modules.bamtools.bamtools,7.0,40187375
132,40187375/multiqc/modules/peddy/__init__.py,multiqc.modules.peddy.__init__,7.0,40187375
4,40328394/angr/analyses/binary_optimizer.py,angr.analyses.binary_optimizer,7.0,40328394
8,40328394/angr/analyses/calling_convention.py,angr.analyses.calling_convention,7.0,40328394
20,40328394/angr/analyses/cfg/indirect_jump_resol...,angr.analyses.cfg.indirect_jump_resolvers.__in...,7.0,40328394
...,...,...,...,...
136,104973687/myia/operations/prim_partial.py,myia.operations.prim_partial,7.0,104973687
143,104973687/myia/operations/prim_return_.py,myia.operations.prim_return_,7.0,104973687
181,104973687/myia/operations/prim_stop_gradient.py,myia.operations.prim_stop_gradient,7.0,104973687
184,104973687/myia/operations/prim_tagged.py,myia.operations.prim_tagged,7.0,104973687


In [48]:
processed_df[processed_df["hops"] == 7].repo.unique().shape

(6,)

In [49]:
processed_df[processed_df["hops"] == 8]

Unnamed: 0,path,module_name,hops,repo
7,40187375/multiqc/modules/bamtools/__init__.py,multiqc.modules.bamtools.__init__,8.0,40187375
33,40328394/angr/analyses/datagraph_meta.py,angr.analyses.datagraph_meta,8.0,40328394
40,40328394/angr/analyses/decompiler/optimization...,angr.analyses.decompiler.optimization_passes.d...,8.0,40328394
42,40328394/angr/analyses/decompiler/optimization...,angr.analyses.decompiler.optimization_passes.m...,8.0,40328394
43,40328394/angr/analyses/decompiler/optimization...,angr.analyses.decompiler.optimization_passes.m...,8.0,40328394
49,40328394/angr/analyses/decompiler/structurer.py,angr.analyses.decompiler.structurer,8.0,40328394
50,40328394/angr/analyses/disassembly.py,angr.analyses.disassembly,8.0,40328394
63,40328394/angr/analyses/identifier/__init__.py,angr.analyses.identifier.__init__,8.0,40328394
93,40328394/angr/analyses/loop_analysis.py,angr.analyses.loop_analysis,8.0,40328394
95,40328394/angr/analyses/propagator/__init__.py,angr.analyses.propagator.__init__,8.0,40328394


In [50]:
processed_df[processed_df["hops"] == 8].repo.unique().shape

(3,)

In [51]:
processed_df[processed_df["hops"] == 9]

Unnamed: 0,path,module_name,hops,repo
36,40328394/angr/analyses/decompiler/clinic.py,angr.analyses.decompiler.clinic,9.0,40328394
38,40328394/angr/analyses/decompiler/optimization...,angr.analyses.decompiler.optimization_passes._...,9.0,40328394
129,40328394/angr/annocfg.py,angr.annocfg,9.0,40328394
130,40328394/angr/blade.py,angr.blade,9.0,40328394
249,40328394/angr/knowledge_plugins/cfg/cfg_manage...,angr.knowledge_plugins.cfg.cfg_manager,9.0,40328394
318,40328394/angr/procedures/java_io/read.py,angr.procedures.java_io.read,9.0,40328394
319,40328394/angr/procedures/java_io/write.py,angr.procedures.java_io.write,9.0,40328394
331,40328394/angr/procedures/java_lang/character.py,angr.procedures.java_lang.character,9.0,40328394
332,40328394/angr/procedures/java_lang/double.py,angr.procedures.java_lang.double,9.0,40328394
333,40328394/angr/procedures/java_lang/exit.py,angr.procedures.java_lang.exit,9.0,40328394


In [52]:
processed_df[processed_df["hops"] == 9].repo.unique().shape

(1,)

In [53]:
processed_df[processed_df["hops"] == 10]

Unnamed: 0,path,module_name,hops,repo
3,40328394/angr/analyses/backward_slice.py,angr.analyses.backward_slice,10.0,40328394
24,40328394/angr/analyses/cfg/indirect_jump_resol...,angr.analyses.cfg.indirect_jump_resolvers.mips...,10.0,40328394
62,40328394/angr/analyses/girlscout.py,angr.analyses.girlscout,10.0,40328394


In [54]:
processed_df[processed_df["hops"] == 10].repo.unique().shape

(1,)

In [55]:
processed_df[processed_df["hops"] == 11]

Unnamed: 0,path,module_name,hops,repo


(The deepest ML import chain was 10 hops of indirection in the `angr` project. We will label up to 4 hops of indirection, as it falls off sharply and only 12 repos had chains of length 5 or more.)

## Categorisation of each Python code file

We categorise each Python file as one of:
- (unit) test
- ML₀
- ...
- ML₄
- other

In [56]:
ml0 = processed_df[processed_df["hops"] == 0]["path"].unique()
ml1 = processed_df[processed_df["hops"] == 1]["path"].unique()
ml2 = processed_df[processed_df["hops"] == 2]["path"].unique()
ml3 = processed_df[processed_df["hops"] == 3]["path"].unique()
ml4 = processed_df[processed_df["hops"] == 4]["path"].unique()

In [57]:
test = paths_unittests

In [58]:
proj_labels = proj_imports_all[["repo", "path"]].copy()
proj_labels = proj_labels.drop_duplicates()

In [59]:
proj_labels["cat"] = "other"

In [60]:
proj_labels.loc[proj_labels["path"].isin(ml0), "cat"] = "ml0"
proj_labels.loc[proj_labels["path"].isin(ml1), "cat"] = "ml1"
proj_labels.loc[proj_labels["path"].isin(ml2), "cat"] = "ml2"
proj_labels.loc[proj_labels["path"].isin(ml3), "cat"] = "ml3"
proj_labels.loc[proj_labels["path"].isin(ml4), "cat"] = "ml4"

In [61]:
proj_labels.loc[proj_labels["path"].isin(test), "cat"] = "test"

In [62]:
proj_labels

Unnamed: 0,repo,path,cat
0,190000321,190000321/setup.py,other
3,190000321,190000321/test/acceptance_tests/from_notebooks...,ml0
16,190000321,190000321/test/acceptance_tests/from_notebooks...,ml0
32,190000321,190000321/test/acceptance_tests/from_notebooks...,ml0
57,190000321,190000321/test/unit_tests/datautil/test_trial_...,test
...,...,...,...
1329153,38355280,38355280/lib/core/exceptions.py,other
1329154,38355280,38355280/lib/core/core.py,other
1329155,38347270,38347270/main.py,other
1329156,38347270,38347270/.buildozer/android/app/main.py,other


In [63]:
proj_labels["cat"].value_counts()

other    160181
ml0       39789
test      26510
ml1        6365
ml2        2510
ml4         463
ml3         444
Name: cat, dtype: int64

In [64]:
proj_labels.to_csv(join(NB_OUT, "proj_labels.csv"), index=False)